Bayesian Models of Brains, Minds, & Behaviors 
DRCMR · Copenhagen · May 2025 
Ollie Hulme · David Meder · Janine Bühler · Melissa Larsen
    Amin Kangavari · Simon Steinkamp · Naiara Demnitz

DRCMR logo

Preamble

  • How the week will go
  • Materials
  • Schedule
  • Social 😊
  • Aims & expectations

Course overview

  • Basic modelling → Get started
  • Intermediate modelling → Go deeper
  • Integrate neural data → Bridge brain & behavior
  • Group presentations → Show what you built
  • lectures · interactive exercises · discussion · case studies · group work

Group work & presentations

  • Separate into small groups
  • Supervision from multiple experts
  • Pick question that requires cognitive modelling
  • Design experiment that collects behavioral & neural data
  • Design models, sketch graphical diagrams, plan analyses to test hypotheses & answer question
  • Present projects, share your work, ~15 mins on fri
  • Audience questions & discussion

Materials: Interactive exercises

  • Run interactive python code via Binder
  • No installation needed → just open in browser and go 😌
  • Loading takes time → keep your tab open
  • binder screenshot

  • link in footer 👇

Materials: GitHub

  • All materials + code on GitHub
  • github screenshot

  • link in footer 👇

Materials: Book

  • book screenshot
  • We rely heavily on it especially on days 1-3

Schedule

  • Up to date schedule
  • link in footer 👇

Social: WhatsApp

  • Chat with each other 💬
  • Ask & answer questions amongst yourselves 🗣️
  • Stay in touch ❤️
  • Friday bar! 😄
  • Join group via 👇

Overarching aim

  • Use the tools of probability theory to scientifically understand brains, minds, and behavior
  • New perspective. Upgrade your scientific thinking
  • Simplicity. Tools are parsimonious, flexible, intuitive, and powerful.
  • Universality. Same principles apply to science, statistic, brains, minds, behaviors, evolution, even physics.

Specific aims

  • Focus on mind & behavior → Cognitive models
  • Connect to neural data via simple methods 🧠
  • Stop describing 🥱 → Start explaining 😌
  • Engage. Hands-on · Interactive · Discursive
  • KISS. Keep It Simple, Stoopid 😜
  • Vibes. Intuitive · Playful · Fun · Useful ✨

Our expectations of you:

  • Life is short. Don’t things pass that you dont understand.
  • Your life matters. Your understanding matters. Actively take part.
  • You are entitled to understand this.
  • Don’t listen to it, fight it. 🥊
  • Disrupt. Ask questions, interrupt. 🙋‍♂️
  • Be sceptical. The truth is out there. 🤨

We love to hear:

  • “I’m probably being stupid…” 🤪
  • “I might have missed this…”
  • “i dont understand why…”
  • “do you have an intuition for why…”
  • “i’m confused…” 🤔
  • If you are shy. Say it in the break / ask in whatsApp

Don’t do this to yourself:

  • Nod and pretend you get something 🙅‍♂️
  • Assume you are the only one confused 😵‍💫😵
  • Collapse under self doubt.

Interactive Gaussian

Basics of Bayesian Analysis

  • The spirit of Bayesian thinking
  • What is Bayesian modeling?
  • Principles of Bayesian inference
  • Observable vs. latent variables
  • Beliefs and evidence
  • Estimation methods
  • Why Bayesian methods?

Bayesian Thinking in Quotes


  • “Probability theory is nothing but common sense reduced to calculation.” — Laplace (1814)

  • “The rules of probability are the rules of consistent reasoning.” — Jaynes (2003)

  • “Bayesian methods are not a special brand of inference; they are the only logically consistent rules for inference that are known.” — Jaynes (2003)

Bayesianism as the Calculus of Common Sense



  • Bayesianism is just probability theory applied to inference. — Jaynes (2003)

From Probability to Rationality

  • Bayesian modeling follows the rules of probability.
  • Probability is logic.
  • Logic is consistency.
  • Consistency is rationality.

What is “Bayesian Modeling of Minds, Brains, and Behavior”?

  • This course is about thinking clearly about minds, brains, and behavior.
  • It’s about testing theories rationally, using the evidence provided by data.
  • Bayesian modeling offers a principled, rational way to update beliefs based on evidence.

General principle

  • Prior belief → Evidence → Posterior belief

Example of a cognitive task

  • e.g., go/no-go

Cognitive task example

  • 10 binary trials of equal difficulty
  • Estimate ability \(\theta\) from behavior
  • \(\theta\) is latent;
  • data are observed
  • e.g., correct responses \(k = 8\) out of \(n = 10\)

Latent vs. observed

Why latent variables?

  • We want to explain, not just describe
  • Descriptive: e.g. “What did the subject score on the task?”
  • Explanatory: e.g. “What ability produced that score?”
  • Psychology, neuroscence, science in general ultimately seeks explanations in terms of latent variables or processes

Why focus on the latent?

  • Scientific questions are causal:
    • Do parkinsons patients differ in risk taking on and off medication?
    • Does serotonin change empathy?
    • Does alpha waves cause memory consolidation?
    • Does ozempic improve cognitive flexibility?
  • In each of these examples we have observable data but we want to infer about something latent.
  • Bayesian modelling lets us infer about what we can’t directly observe directly.

Back to the cogntive task

  • Observed: number correct (\(k/n\))
  • Latent: ability (\(\theta\))
  • Same \(\theta\) can result in different \(k/n\)
  • Different \(\theta\) can result in the same \(k/n\)
  • The relationship between them is probabilistic and therefore uncertain
  • Bayesian models allow us to infer latent variables from observed data, accounting for this uncertainty

Beliefs as Distributions

  • Probability distributions encode beliefs
  • The center reflects what is most likely
  • The spread reflects uncertainty

Interactive Gaussian

  • Play around with distributions to represent different beliefs
  • 📂 notebooks/gaussian_interactive_plot.ipynb

Interpreting probability distributions

  • See “interpreting probability distributions”
  • 📂 notebooks/gaussian_interactive_plot.ipynb

Prior over \(\theta\)

  • Let’s set our prior beliefs about \(\theta\)
  • What are the upper and lower limits of \(\theta\)?
  • Lets play around with our prior beliefs
  • 📂 notebooks/beta_interactive_plot.ipynb

Bayes’ rule

  • \(p(\theta | D) = \frac{p(D|\theta) p(\theta)}{p(D)}\)
  • Posterior = \(\frac{\text{Likelihood} \times \text{Prior}}{\text{Marginal likelihood}}\)
  • This tells us how priors beliefs are updated via evidence

Updating Beliefs with Evidence

  • We start with a prior belief about ability: \(p(\theta)\)
  • Then we observe data: e.g. ( k = 9 ) correct out of ( n = 10 )
  • Bayes tells us how to update our beliefs:
  • \(\text{Prior} \quad \rightarrow \quad \text{Likelihood} \quad \rightarrow \quad \text{Posterior}\)
  • \(p(\theta) \quad \rightarrow \quad p(D|\theta) \quad \rightarrow \quad p(\theta|D)\)

Intuition behind belief updating

  • The more likely the data is for a given \(\theta\)
  • The more we believe in that value of \(\theta\) after seeing the data.
  • The values of theta that are better supported by the data, are more believed in after experiencing the data
  • We have updated our beliefs according to the data

Marginal likelihood \(p(D)\)

  • Normalizes the posterior
  • Just a single number that scales to ensure it is a probability distribution
  • Independent of \(\theta\)

Proportional form

  • \(p(\theta|D) \propto p(D|\theta) \times p(\theta)\)

Posterior = belief distribution

  • Posterior quantifies belief about \(\theta\) after observing the data

Analytic posterior

  • Prior: \(\text{Beta}(1,1)\)
  • Posterior: \(\text{Beta}(1+k, 1+n-k)\)
  • \(k\) = correct, \(n\) = total trials

Credible intervals

  • \(\text{BCI}_{95\%} [0.4, 0.8]\) → 95% chance \(\theta \in [0.4, 0.8]\)
  • Matches intuitive interpretation of intervals
  • See BCI section of 📂 notebooks/beta_interactive_plot.ipynb

Sequential updating (1)

  • New data: 3/5 correct
  • Combine with prior data (9/10 → 12/15)

Sequential updating (2)

  • Two-step: Prior → Posterior → Posterior
  • One-step: Prior → Combined data → Posterior
  • Final result is identical

Analytic posterior

  • Prior: \(\text{Beta}(1,1)\)
  • Posterior: \(\text{Beta}(1+k, 1+n-k)\)
  • \(k\) = correct, \(n\) = total trials

Interactive beta distribution

  • Adjust \(k\), \(n\) to see posteriors
  • Identify cases of high or low certainty

Conjugate priors

  • Prior/posterior in same distribution family = conjugate
  • Enables analytic updating
  • Not always available

When conjugacy fails: MCMC

  • MCMC = Markov Chain Monte Carlo
  • Approximate sampling from posterior
  • Robust when no analytic form exists

Analytic vs. sampling

  • Compare analytic posterior to MCMC samples
  • Both estimate posterior

MCMC in practice

  • Two options:
    • Red pill: Learn MCMC internals
    • Blue pill: Accept it works, use it 😄

Why use Bayesian methods?

  • Flexible: Handles complexity, missing data
  • Principled: Represents uncertainty fully
  • Intuitive: Often aligns with reasoning
  • Accessible: Easy to apply once learned

Modelling a binary process

  • Start simple → binary processes
  • Most/many tasks are binary processes 😲

Getting started

  • Our first example was about inferring a rate for a binary process
  • Binary process since either correct or incorrect
  • Inferring a rate, since we were inferring the underlying ability as a probability \(\theta\)
  • Lots of tasks can be represented as binary processes

Cognitive tasks that are binary processes

  • Task switching
  • Go-no-go
  • Stop signal

From binary outcomes to rate

  • Binary outcomes typically k correct out of n
  • Expressed as a rate = k / n
  • This is the observed rate
  • But what is the underlying hidden rate \(\theta\)
  • \(k \sim \text{Binomial}(\theta, n)\)
  • assuming no history dependent effects where what happens in past influences the present

From binary outcomes to rate

  • observing k out of n trials allows us to update our beliefs about \(\theta\)
  • what we know, and what we dont know of variables is alwwys represented by probability distributions
  • \(\theta \sim \text{Beta(1,1)}\)
  • \(k \sim \text{Binomial(\theta,n)}\)

Graphical

  • Graphical models represent our probabilisitc model
  • nodes represent all the variables relevant to the problem
  • graph structure represent dependencies - children depend on parents

Graphical notation

  • Circular - continuous, Square - discrete
  • Shaded - observed, Non-shaded - hidden
  • Single-boundary - stochastic, double-boundary - deterministic

Sampling via JAGS

  • JAGS code that would run this model
model{ 
  theta ~ dbeta(1,1)
  k ~ dbin(theta,n)
}
  • We sample from the posterior to find the posterior distribution of \(\theta\) for the data k = 5, n =10
  • [insert Fig.2.8]

R-hat as a convergence check

  • Its important to check that the sampling has converged to the stationary distribution
  • One heuristic is the R-hat [ = ] -Rule of thumb: \(\hat{R}\) should be 1-1.01 for convergence to be achieved

footer: Learn more: Brooks & Gelman (1998).

Lecture#3: Parameter estimation

Posterior from the last example

  • We estimated the posterior distribution of \(\theta\) given the data
  • We sampled from this model
  • and we got this distribution

Conjugacy

  • Beta distributions has a natural interpretation
  • \(\theta \sim text{Beta(\alpha, \beta)}\)
  • \(\alpha\) and \(\beta\) can be thought of as counts
  • \(\alpha = 1 + text{successes} = 1 + k\)
  • $= 1 + text{failures} = 1 + (n-k) $

Why 1 + count?

  • Beta(1,1) is a uniform distribution so this is equivalent to 0 counts
  • It is your belief distribution when you have no knowledge etc.
  • Thus evidence accumulates from 1, via counting of successes and failures

Conjugacy

  • Notice that the update from the prior to the posterior just comes from adding the counts to the beta distribution
  • prior: Beta(1,1) → posterior: Beta(1+k, 1+(n-k))
  • prior: Beta(1,1) → posterior: Beta(1+ successCount, 1+ failureCount)
  • No sampling is needed because we have the equation
  • When the prior and posterior have same type of distribution they are “conjugate”

Conjugacy interactive example

  • Binder example- play with the beta distribution
  • No sampling needed you just change inputs to the beta distribution to compute posterior

Difference between two rates

  • Suppose we have to processes, say two tasks producing:
  • k1 successes out of n1 trials
  • k2 successes out of n2 trials
  • let’s assume they are generated by two different rates \(\theta_1\) and \(\theta_2\)
  • we want to know the difference in the rates \(\delta = \theta_1 - \theta_2\)
  • e.g. the effect of a drug on cognitive performance
  • e.g. the effect of a age on task performance

Graphical model for inferring differences in rates

Sampling code for infferring differences in rates

Sampled posterior for difference in rates

Interpreting distributions

  • Probability mass functions are for discrete variables which take finite number of values
  • Probability density functiosn for continuous variables which take infinitely many values

Probability mass functions

  • insert left figure from box 3.2
  • all sum to 1

Probability density functions

  • insert right figure from box 3.2
  • area sums to 1
  • densities can exceed 1
  • only areas can be interpreted as probabilities

Inferring a common rate

  • In some cases we want to infer a rate for 2 processes
  • e.g. same subject, same task, two different sessions
  • Here we would model with a single \(\theta\)

Same model with plate notation

  • note only one theta, multiple processes

Code for inferring a common rate

  • model{ k1 ~ dbin(theta1,n1) k2 ~ dbin(theta2,n2) theta ~ dbeta(1,1) }

Predicting data

  • We know 2 distributions already
    1. Prior distribution of parameter e.g. \(p(\theta)\)
  • This is our prior prediction for what the parameters will be, prior to the data
    1. Posterior distribution of parameter having observed the data e.g. \(p(\theta|data)\)
  • This is our posterior prediction for the what the parameters will be, having seen the data
  • Remember todays posterior is tomorrows prior, so they can be a prediction for the next round of data, ad infinitum.

Predicting data

  • Priors and posteriors are of parameters. what about actually predicting data?
    1. Prior predictive distribution \(p(data)\)
  • this is the prediciton of what the data will be based on the prior, prior to the data being seen.
    1. Posterior predictive distribution \(p(new_data|old_data)\)
  • this is the prediciton of what the data will be based on the posterior, after the data has been seen.

Prior and posterior prediction

model { k ~ dbin(theta,n) theta~dbeta(1,1) postpredk~dbin(theta,n) thetaprior~dbeta(1,1) priorpredk~dbin(thetapior,n) }

Samples of the four distributions

  • Note the prior and posterior are in the space of the parameter
  • And the prior and posterior predictive distributions are in space of the data k out of n trials.

Comparing data to the posterior predictive distribution

  • If we estimate the model along with its predictive distributions
  • We can see how this compares to the actual data
  • Here we see it’s not good. The model has poor descriptive adequacy
  • Why is it not a good model?
  • Its a common rate model, so it predicts the same rate, but the data clearly is better modelled with different rates.

Inference and time

  • prediction naturally suggests thinking about the future.
  • predictions can also apply to the past—for instance, when information is missing.
  • data can help infer hidden cognitive variables, which in turn may predict past behavior where information is incomplete.
  • similar to how historians reconstruct past events from limited evidence.
  • or a court of law infers guilt based on available testimony and facts.
  • we predict what might have been known in the past if more information had been available.
  • Inference allows prediction both forward and backward in time.

Todays posterior is tomorrows prior

  • Since the posterior distribution of parameters is continually updatable, then so is the posterior predictive distribution.
  • As the posterior is updated, then so are its predictions for the data.
  • This can carry on forever.

Inferences with Gaussians

  • Due to central limit theorem, data and parameters are frequently Gaussian
  • Gaussians have 2 parameters, a mean and a measure of their spread
  • Spread can be expressed as a variance, std, or a precision (1/var)

Graphical model for Gaussians

  • Simple model for inferring Gaussian with unknown mean and std

Interactive demo of Gaussian

  • Jupyter notebook - interactive plotting of gaussian

Sampling model for inferring Gaussians

model { for (i in 1:n){ x[i]~dnorm(mu,lambda) } mu~dnorm(0,0.001) sigma~dunif(0,10) lambda<-1/pow(sigma,2) }

Repeated measures of IQ

  • Imagine taking a cognitive test like IQ multiple times
  • The mean is your IQ, and the spread models fluctuations in your performance. e.g. attention, fatigue, emotion, venus orbitting satturn
  • We can model this as a Gaussian for each person

Graphical model for IQ

  • What parameter is common to all subjects?
  • No index on the std. This means it is fixed.
  • Is this justified?
  • How to change it?

Sampling code for IQ

model{ for (i in 1:n) { for (j in 1:m) { x[i,j]~dnorm(mu[i],lambda) } } sigma~dunif(0,100) lambda <-1/pow(sigma,2) for (i in 1:n) { mu([i]~ dunif(0,300)) }

}

Latent Mixture Models

  • A latent mixture model is a model that assumes data are generated from a mixture of different hidden (latent) processes
  • E.g. imagine a cogntive test where some people try their best, and others just guess
  • the scores then are generated by two mechnamisms, ability and guessing
  • how would we model this as a mixture of processes

Latent mixture model of cognitive test

  • to model the “tryers” you would model their ability as a rate of correct performance, as before.
  • to model the “guessers” you just model it as chance performance

Latent mixture graphical model

  • Here z determines whether the person is a tryer or a guesser. Thus z is a parameter that models the mixture of processes that explain the data.
  • We know the probability of guessing is 0.5
  • The posterior distribution of z gives us the estimate of how many guessers there were.

Latent mixture code

model{ for (i in 1:p) { z[i]~dbern(0.5) } psi <-0.5 phi ~ dbeta(1,1)I(0.5,1) for (i in 1:p) { theta[i]<- equals(z[i],0)psi+equals(z[i],1)phi k[i]~dbin(theta[i],n) } }

Model selection

  • Model selection is perhaps the most persuasive reason to choose Bayesian methods.
  • Bayesian models provide a simple and principled way to choose between models balancing between accuracy and complexity

Why model comparison?

  • so far we have covered single models
  • in psychology/neuroscience we want to compare different models
  • is this theory or this theory better at explaining the data
  • if someone says this is the theory. it begs the question, compared to what?
  • we want to know how well a theory explains relative to another, or many others.
  • we need to compare models if we are to move beyond descriptive data analysis.

Ptolomy

  • We consider it a good principle to explain the phenomena by the simplest hypothesis, provided this doesnt contradict the data in an important way

Occam’s razon

  • “Plurarity must never be posited without necessity”
  • This is nicknamed Occam’s razor, which cuts out needless complexity when it comes to theories or models or explanations

Friston’s free energy

  • “Minimizing free energy means finding the right balance between accuracy and complexity”
  • “Good models should be as simple as possible, but not simpler than necessary to explain the data. Complexity must be minimized to avoid overfitting, yet sufficient to capture the underlying structure of the world.”

Geoff Hinton

  • “Model complexity must pay for itself”
  • Geoff is referring to the fact that model complexity must be justified by its ability to explain the data.

Bayesian magic

  • Lots of methods exist for trying to get this balance right
  • Out of sample prediction, parameter counting algorithms, blah, blah.
  • Bayes nails it by finding a universal principle by which to compare models that optimally balances accuracy and complexity.
  • This is the magic sauce of Bayes.

Marginal likelihood

  • Back to Bayes equation we started with
  • There is something missing which is that this is all conditional on a particular model. We could have a very different model that might have different parameters.

Marginal likelihood conditional on model

  • We can be explicit about how this is conditional on a specific model
  • \(p(D|M_1)\) is a single number, the marginal likelihood, also known as the evidence.

Marginal likelihood in words

  • the probability of the data given model
  • the probability of the data according to the models predictions
  • average predictive performance of the model
  • how unsuprised was the model when seeing the data
  • the prior in the model is like betting on where good parameters lie. The marginal likelihood sums up how well those bets paid off in terms of predictive performance when the data arrived.
  • a models predictions has probability mass of 1. it can spend its predicitons in different ways by having different priors. a bad prior wastes this on predictions that are not supported by the data, a good prior concrentrates its predicitons to where the data is well predicted.
  • the better the predictions of the model, the greater the evidence for the model
  • it’s like running every possible version of the model (with every parameter setting weighted by the prior), and asking: “on average, how well does the model predict the data?”

Marginal likelihood of paranormal octopi

  • Imagine two paranormal octupuses helping the KGB find a missing sub.
  • Alice predicts the sub is in the nothern hemisphere
  • Bob predicts its in nothern europe
  • Data: the sub is found in the Baltic,
  • the probability of the data according to Alice is reasonable. the evidence in favor of alice is reasonable
  • the probability of the data according to Bob is high. The evidence in favor of Bob is high
  • both were correct, but Bob was “more correct” because his prediciton was more specific

Marginal likelihood

  • The marginal likelihood is computed by averaging the likelihood of the data predicted across the models parameter space.
  • Prior probabilities act as averaging weights.
  • Based on the law of total probability:
    \(p(D \mid M_1) = \sum_{i=1}^{k} p(D \mid \xi_i, M_1) p(\xi_i \mid M_1)\)

Example calculation of marginal likelihood

    • Consider a model \(M_x\) with one parameter \(\xi\).
  • \(\xi\) can take three values:
    • \(\xi_1 = -1\), \(\xi_2 = 0\), \(\xi_3 = 1\).
  • Prior probabilities assigned:
    • \(p(\xi_1) = 0.6\)
    • \(p(\xi_2) = 0.3\)
    • \(p(\xi_3) = 0.1\)

Likelihood Computation

    • Likelihood values for observed data \(D\):
    • \(p(D \mid \xi_1) = 0.001\)
    • \(p(D \mid \xi_2) = 0.002\)
    • \(p(D \mid \xi_3) = 0.003\)
  • Compute marginal likelihood:
    \(p(D \mid M_x) = p(\xi_1) p(D \mid \xi_1) + p(\xi_2) p(D \mid \xi_2) + p(\xi_3) p(D \mid \xi_3)\) \(= 0.6 \times 0.001 + 0.3 \times 0.002 + 0.1 \times 0.003\) \(= 0.0015\)

Marginal likelihood and complexity

  • To have high evidence, the model needs good predictive peformance
  • It needs to focus its predictions where the data has a high likelihood
  • Complex models with many parameters distribute their predictions widely
  • Complex models thus make many predictions that may not turn out to have high likelihood
  • Complex models thus are more likely to “waste their predictions”

Model complexity is not the same as the number of parameters

  • Complexity is not just the number of parameters.
  • A model with few parameters can still be complex if its predictions are highly uncertain due to vague priors.
  • True complexity comes from how broadly the model spreads its predictions across its parameter space.
  • A narrow prediction distribution suggests a simpler model.
  • A wider distribution indicates greater complexity, as the model allows for more diverse possible outcomes.
  • Models are complex to the degree to which they make broad range of predictions

Model complexity and vagueness of priors

  • a model with prior \(\theta \sim \text{Uniform} (0,1)\)
  • …is more complex than…
  • a model with prior \(\theta \sim \text{Uniform} (0.5,1)\)

Marginal likelihood as a unifying maximandum

  • Marginal likelihood is arguably the most important concept ever.
  • It is at the heart of scientific inference, psychological and neural inference, and survival, and even existence of objects.
  • Maximising it is hard, we often need to approximate it.
  • Variational Bayes, Free-energy minimisation, Predictive coding, MCMC Sampling etc. are all ways to approximate it.
  • Ultimately it is one quantity we care about.
  • In our context we want to build models of brain, mind or behavior, that best explain these phenomena.
  • This reduces to finding models that have the best average predictive performance.
  • And this reduces to finding models with the highest marginal likelihood
  • Thats it.
  • Everything else is details.
  • Ah but so and so maximises something different.
  • Yes but at heart this is just a proxy for the marginal likelihood.

The Bayes factor

  • Compared to what? This is a surprisingly powerful question to ask.
  • This model is good - it has high model evidence.
  • Ok great, but compared to what?
  • We want relative evidence. We want to compare predictive performance for one model versus another.
  • The Bayes factor gives us this by taking the ratio of the marginal likelihoods for two different models

Bayes factor equation

  • Marginal likelihood measures a model’s absolute evidence by averaging its predictive performance over all parameter values.
  • Model selection often focuses on relative evidence —how well one model explains the data compared to another.
  • This is quantified using the Bayes factor:
    \(BF_{12} = \frac{p(D \mid M_1)}{p(D \mid M_2)}\)

Interpretation of the Bayes Factor

  • BF > 1 → Data favors M₁ over M₂.
  • BF < 1 → Data favors M₂ over M₁.
  • Higher BF → Stronger evidence for the favored model.
  • If BF = 5, the data is 5 times more likely under M₁ than M₂.
  • If BF = 1/5, the data is 5 times more likely under M₂ than M₁.

Jeffreys’ Scale for Bayes Factor

Bayes Factor (BF₁₂) Interpretation
>100 Extreme evidence for M₁
30–100 Very strong evidence for M₁
10–30 Strong evidence for M₁
3–10 Moderate evidence for M₁
1–3 Anecdotal evidence for M₁
1 No preference
1/3–1 Anecdotal evidence for M₂
1/10–1/3 Moderate evidence for M₂
1/30–1/10 Strong evidence for M₂
1/100–1/30 Very strong evidence for M₂
<1/100 Extreme evidence for M₂

Example: Binomial Model

  • Suppose we have 9 correct (k) out of 10 trials (n) and compare two models:
  • M₁: A non-guessing model with unknown success rate.
    \(p(\theta \mid M_1) \sim \text{Uniform}(0,1)\)
  • M₂: A guessing model with chance at 0.5 \(p(\theta \mid M_2) = 0.5\)
  • Compute marginal likelihoods for both models.

Bayes Factor for the example

  • Marginal likelihoods:
  • \(p(D \mid M_1) = \left(\frac{10}{9}\right) \left(\frac{1}{2}\right)^{10}\)
  • \(p(D \mid M_2) = \left(\frac{1}{1+n}\right)\)
  • Bayes factor calculation:
    \(BF_{12} = \frac{p(D \mid M_1)}{p(D \mid M_2)}\)
  • $ BF_{12} = 0.107
  • If \(BF_{12} > 1\), M₁ is preferred; if < 1, M₂ is preferred.

Flipping the BF

  • m1 is 0.107 less likely than m2
  • hard to interpret
  • thus when a BF is below it is often more convenient to take the reciprocal so that it is a value above 1.
  • here it would be \(BF_{21}\) which is 1/1.07 = 9.3
  • M2 is 9.3 times more likely than M1
  • the data is 9.3 times more likely under a model in which subjects are not guessing, vs. than in which they are guessing

Bayes vs. Fisher

  • Bayesian methods are typically comparative, where two ore more models are compared
  • Frequentist methods are not comparative, simply considering how unlikely data was generated by the null.
  • Evidence is always comparative. ratio of p(data|Innocent) : p(data|Guilty)

Cautions and Critiques

  • Bayes factors sensitive to prior specification.
  • Must use careful prior selection and sensitivity analyses.
  • Point null hypotheses controversial: can they ever be exactly true?
    • Pragmatically, yes (experimental studies).
    • Conceptually, more debated.

Posterior model probabilities

  • Bayes factors compare the predictive performance of two different theories
  • But model plausibility also depends on prior beliefs in each model.
  • To assess relative plausibility after seeing the data we combine:
  • Predictive performance (likelihood) and Prior plausibility (prior probabilities)

Posterior odds

  • Relaitve plausibility of models after seeing the data is indicated by the posterior odds
  • The posterior odds of two models is: \(\frac{p(M_1 \mid D)}{p(M_2 \mid D)} = \frac{p(D \mid M_1)}{p(D \mid M_2)} \cdot \frac{p(M_1)}{p(M_2)}\)
  • Or in words: posterior odds = Bayes factor × prior odds

Bayes factor transforms prior odds into posterior odds

  • The Bayes factor transforms prior odds into posterior odds: \(\text{prior odds} \rightarrow \text{posterior odds}\)

Advantages of the Bayes factors

  • Bayesian hypothesis tests (e.g., Bayes factors) naturally implement Ockham’s razor, penalising complex models when comparing them.
  • They quantify relative support for multiple models.
  • Allow model-averaged predictions over parameters.
  • Bayes factors can provide evidence in favor of the null hypothesis (unlike frequentist)
  • Bayes factors can be forever updated as data arrive (no fixed sample size or plan required)
  • Easy to understand. How many more times likely the data is for one model than another. Thats it. What was the p-value again?

Extraordinary Claims Require Extraordinary Evidence

  • Echoing Hume, Laplace, and Sagan: extraordinary claims require extraordinary evidence.
  • This principle is baked into Bayesian methods through posterior odds:
    \(\text{Posterior odds} = \text{Bayes factor} \times \text{prior odds}\)
  • The prior plausibility of a model matters.
  • Even strong evidence may not overturn implausible claims.
  • Bayesian methods make explicit the influence of prior beliefs. Thats a good thing.

Problems with p-values

  • p-values often misinterpreted
  • ask your supervisor to define p-values. Most can’t.
  • p-values do not provide evidence for the null.
  • depend on experimenter’s intentions, and what they would do, if the data had turned out differently, as well as alternative hypotheses.
  • Classical hypothesis testing is asymmetric:
  • Null can be rejected, but never confirmed.

Optional Stopping

  • Bayesian methods allow data collection to stop at any time based on evidence.
  • p-value methods require pre-specified stopping rules.
  • Researchers can continue or stop based on interim Bayes factor values.
  • Provides more flexibility and transparency.

Challenges for Bayesian Approach

  • Conceptual: Sensitivity to prior distributions.
  • Computational: Difficulty in calculating marginal likelihoods.
  • Using vague priors can lead to low-precision predictions.
  • The Ockham’s razor property penalizes overly vague models.
  • Priors must reflect meaningful knowledge — vague priors yield unhelpful results.
  • Vague priors entail complex models

Specifying Good Priors

  • Researchers must encode prior knowledge into prior distributions.
  • Options:
    • Subjective specification: Based on domain expertise.
    • Objective priors: E.g., unit-information priors, do not rely on specific prior knowledge.
  • Objective priors useful for:
    • Wide applicability.
    • Transparent baseline comparisons.
    • Refinement with specific information when needed.

Confusion About Priors

  • Common misconception: Bayes factor depends on the prior plausibility of the models.
  • In truth:
    • Bayes factor is unaffected by prior probabilities of the models.
    • It does depend on the prior over model parameters (e.g., effect size).
  • Important distinction:
    • Prior on models affects posterior probabilities but not bayes factor.
    • Prior on parameters affects the Bayes factor itself and hence the posterior odds.

Prior Sensitivity

  • When different parameter priors yield different conclusions, it shows scientific uncertainty.
  • Strategies to handle prior sensitivity:
    • Local/intrinsic/fractional/partial Bayes factors.
    • Sensitivity analyses: Vary the priors on parameters and observe effects on conclusions.
  • Reminder: Some models are robust, others are fragile to prior assumptions.

Computational Challenges

  • Computing marginal likelihoods is hard for complex models.
  • Approximate methods include:
    • Candidate’s formula.
    • Basic marginal likelihood identity: \(p(D \mid M_1) = \frac{p(D \mid \theta, M_1) p(\theta \mid M_1)}{p(\theta \mid D, M_1)}\)
  • MCMC-based methods:
    • Sample from posterior, evaluate likelihoods.
    • Use model indicator variables for model comparison.

Savage-Dickkey method of model comparison

  • In this method two models are compared:
    • Null hypothesis (\(H_0\)): fixes parameter to a specific value, e.g., \(\phi = \phi_0\)
    • Alternative hypothesis (\(H_1\)): parameter free to vary, e.g., \(\phi \ne \phi_0\)
  • \(H_0\) is nested within \(H_1\) (by constraining parameter).
  • Classical null hypothesis usually sharp (point-null).

Savage–Dickey Density Ratio

  • Defines Bayes factor for nested models: \(BF_{01} = \frac{p(D \mid H_0)}{p(D \mid H_1)} = \frac{p(\phi = \phi_0 \mid D, H_1)}{p(\phi = \phi_0 \mid H_1)}\)
  • Simply the ratio of posterior to prior densities at the point of interest \(\phi_0\) under the alternative hypothesis.

Example: Binomial Scenario

  • Binomial scenario: \(\theta\) parameter, observing 9 correct and 1 incorrect response.
  • Null hypothesis (\(H_0\)): \(\theta = 0.5\)
  • Alternative hypothesis (\(H_1\)): \(\theta\) free to vary, prior \(\theta \sim Beta(1,1)\)
  • Bayes factor is the ratio of posterior and prior densities at \(\theta=0.5\)

Visual Interpretation of Savage-Dickey

  • Prior (uniform) and posterior distributions shown.
  • Density ratio at \(\theta=0.5\) gives Bayes factor.

MCMC-Based Estimation for Savage-Dickey

  • When analytical solutions are difficult, use MCMC: -
  • Posterior and prior estimated from MCMC samples.
  • Heights of posterior and prior at the null point give Bayes factor.

Advantages of Savage–Dickey

  • Direct interpretation as density ratio.
  • Simplifies computation —no separate marginal likelihood calculation needed.
  • Works well for nested models.

Commpare Gaussian means

  • Common task: test if two Gaussian means differ
  • Example: does glucose improve detection performance
  • Focus: test claim that glucose boost has larger effect in summer

Data

Season N Mean SD
Winter 41 0.11 0.15
Summer 41 0.07 0.23
  • Difference not significant
  • t = 0.79, p = 0.44

p-values and the Null Hypothesis

  • “From a null result, we cannot conclude that no difference exists…”
  • p = 0.44 does not support H₀
  • It just means data are not incompatible with H₀
  • Need a Bayes factor to quantify support for H₀

Bayes Factor Overview

  • Bayes factor compares posterior vs prior odds
  • Quantifies evidence for or against H₀
  • Unlike p-values, can support H₀

One-Sample Comparison Model

  • Test standardized difference scores (e.g., winter - summer)
  • Assume:
    • δ ~ Cauchy(0,1)
    • xᵢ ~ Gaussian(μ, 1/σ²)
    • μ = δσ

One-Sample Graphical Model

  • Fig 8.1
  • Prior on δ: Cauchy(0,1)
  • Prior on σ: Half-Cauchy
  • Estimate posterior with MCMC

Posterior vs Prior

  • Figure 8.2
  • Posterior peaks near δ = 0
  • Bayes Factor ≈ 5:1 in favor of H₀

Order-Restricted Model

  • SMM predicts δ < 0
  • Use order-restricted prior:
    • δ ~ Cauchy(0,1) truncated to (-∞, 0)

Updated Bayes Factor

  • Figure 8.4
  • Stronger evidence for H₀: BF ≈ 10:1

Summary

  • p-values can’t confirm H₀
  • Bayes factors can
  • “Evidence of absence” of support for SMM’s prediction

Two-Sample Comparison

  • Compare oxygenated vs plain water on memory
  • Two independent groups

Two-Sample Model Structure

  • Figure 8.5
  • Shared variance σ²
  • δ = α / σ
  • α = μₓ - μᵧ

Large Effect Example

Group N Mean SD
Plain Water 20 68.35 6.38
Oxygenated 20 76.65 4.06
t(38) = 4.47, p < .01

Two-Sample Bayes Factor

-Figure 8.6 - Posterior moves away from 0 - BF ≈ 447:1 in favor of H₁
- Decisive evidence for oxygenated water effect

Comparing binomial rates

  • We will naturally compute binomial rates for different groups or conditions
  • And ask which is larger?
  • We thus need to compare binomial rates and test hypotheses about which is bigger etc.

Bayesian graphical model

  • Figure 9.1
  • graphical model for comparing two proportions

Bayesian model

  • We model the observed counts using binomial likelihoods and assign uniform Beta priors: s1 ~ Binomial(theta1, n1)
    s2 ~ Binomial(theta2, n2)
    theta1 ~ Beta(1, 1)
    theta2 ~ Beta(1, 1)
    delta <- theta1 - theta2
  • theta1:
  • theta2:
  • delta = theta1 - theta2: difference in proportions
  • We are interested in the posterior distribution of delta.

Model Code

Here is the model used for posterior simulation: model { theta1 ~ dbeta(1,1) theta2 ~ dbeta(1,1) delta <- theta1 - theta2 s1 ~ dbin(theta1, n1) s2 ~ dbin(theta2, n2) theta1prior ~ dbeta(1,1) theta2prior ~ dbeta(1,1) deltaprior <- theta1prior - theta2prior } This allows us to compare the prior and posterior density of delta at zero.

Prior and Posterior Distributions

  • Figure 9.2
  • We estimate the posterior distribution for the rate difference delta = theta1 - theta2 using Bayesian inference.
  • The left plot shows prior and posterior distributions for delta across its full range.
  • The right plot zooms in near delta = 0.
  • This is used in the Savage–Dickey density ratio to compute the Bayes factor.
  • The Savage–Dickey method compares: BF_01 = prior density at delta = 0 / posterior density at delta = 0

Interpreting the Bayes Factor

  • The posterior density at delta = 0 is about half the prior density.
  • This gives a Bayes factor ≈ 2 in favor of the alternative hypothesis H1: delta ≠ 0.
  • The 95% credible interval for delta is approximately [-0.09, 0.01], which does not include 0.

Interpretation: - There is only modest evidence that one rate is higher than another - The Bayes factor penalizes H1 for spreading prior mass over implausible values.